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Abstract 

Background: CpG islands are observed in mammals and other vertebrates, generally escape DNA methylation, and 
tend to occur in the promoters of widely expressed genes. Another class of promoter has lower G+C and CpG 
contents, and is thought to be involved in the spatiotemporal regulation of gene expression. Non-vertebrate 
deuterostomes are reported to have a single class of promoter with high-frequency CpG dinucleotides, suggesting 
that this is the original type of promoter. However, the limited annotation of these genes has impeded the large- 
scale analysis of their promoters. 

Results: To determine the origins of the two classes of vertebrate promoters, we chose Ciono intestinolis, an 
invertebrate that is evolutionarily close to the vertebrates, and identified its transcription start sites genome-wide 
using a next-generation sequencer. We indeed observed a high CpG content around the transcription start sites, 
but their levels in the promoters and background sequences differed much less than in mammals. The CpG-rich 
stretches were also fairly restricted, so they appeared more similar to mammalian CpG-poor promoters. 

Conclusions: From these data, we infer that CpG islands are not sufficiently ancient to be found in invertebrates. 
They probably appeared early in vertebrate evolution via some active mechanism and have since been maintained 
as part of vertebrate promoters. 



Background 

Among the 16 DNA dinucleotides, the CpG dinucleo- 
tide is unique in terms of its frequency in genomic 
sequences. This most probably results from the DNA 
methylation system because the DNMT1 and DNMT3 
families of the deuterostomes, such as echinoderms and 
chordates, predominantly target the 5 position of cyto- 
sine residues only in the CpG dinucleotide [1]. Because 
the deamination of 5-methylcytosine is not recognized 
by the DNA repair mechanisms, CpG is rapidly mutated 
to TpG or to its complementary dinucleotide CpA [2]. 
Therefore, deuterostome organisms, except for Oiko- 
pleura dioica [3], display a globally reduced frequency of 
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the CpG dinucleotide compared with its expected fre- 
quency calculated from actual numbers of guanine and 
cytosine residues [4,5]. Interestingly, they also display 
skewed distributions of the CpG dinucleotide across 
their genomes, so that their genomes contain CpG-poor 
and CpG-rich domains [6,7]. In amphibians, avians, and 
mammals, the CpG-rich domains are much shorter than 
the CpG-poor domains and are generally known as CpG 
islands [8]. 

CpG islands are good markers of some classes of 
genes because they are often linked to the promoters of 
those genes [9]. In most cases, CpG islands escape DNA 
methylation, which suppresses gene expression in gen- 
eral, in almost every tissue [10] and function as part of 
the gene promoter [11]. Hence, CpG islands tend to be 
related to ubiquitously or broadly expressed genes, 
whereas promoters that lack a CpG island are involved 



o 



© 201 1 Okamura et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons 
BiolVlGCl C6ntr"cll Attribution License (http://creativecommons.0rg/licenses/by/2.O), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Okamura et al. BMC Genomics 201 1, 12(Suppl 3):S7 
http://www.biomedcentral.eom/1 471 -21 64/1 2/S3/S7 



Page 2 of 1 1 



in the spatiotemporal regulation of the genes [12]. It is 
important to note that mammalian promoters can be 
thus divided into the two distinct classes, not only struc- 
turally but also functionally. In the human genome, 
CpG-rich promoters or CpG island promoters are domi- 
nant, occurring more than twice as often as CpG-poor 
promoters [13,14]. 

As anticipated for a vertebrate taxon, CpG island pro- 
moters were indeed experimentally identified in fish by 
an analysis of transcription start sites (TSSs) [15]. The 
presence of two classes of promoters in fish, amphi- 
bians, reptiles, avians, and mammals has since been con- 
firmed in silico [16]. In that study, the authors analysed 
the distributions of the normalized CpG contents (the 
ratio of the observed CpG number to the expected CpG 
number, called the "CpG score" hereunder) of the pro- 
moter sequences in six vertebrate genomes and showed 
bimodal distributions for all of them. Furthermore, the 
structural bimodality was shown to correspond to func- 
tionally distinct classes of genes. The authors also ana- 
lysed three invertebrate promoters, of one sea urchin 
and two ascidian (sea squirt) species, and found unimo- 
dal distributions of high CpG scores, unlike the distribu- 
tions observed in the vertebrate promoters. This led 
them to propose that the vertebrate promoter classes 
differentiated at an early stage of vertebrate evolution, 
with global DNA methylation and subsequent deamina- 
tion. This is basically consistent with the formerly 
accepted evolutionary hypothesis of CpG islands [17,18]. 

If this hypothesis is true, do the non-vertebrate deu- 
terostomes {e.g. echinoderms, lancelets, and ascidians) 
have CpG islands in their genomes? Currently, the pre- 
sence of CpG islands in invertebrate animals is unclear. 
It is possible to apply any criteria that define a CpG 
island to their genomic sequences and identify some 
islands. Nevertheless, we were interested in determining 
whether there are CpG island-like sequences in inverte- 
brate genomes that are associated with transcription 
initiation, and how and when these sequences appeared 
during evolution. 

To address this issue, we identified the TSSs of Ciona 
intestinalis by a combination of the oligo-capping 
method [19] and massive-scale cDNA sequencing 
(RNA-seq, specifically TSS-seq) [20]. The widely used 
model organism C. intestinalis is an ascidian tunicate, 
which although an invertebrate, is most closely related 
to the vertebrates [21]. Although the ascidian evolved 
from the last common ancestor of the ascidians and ver- 
tebrates, it can be presumed to retain many more fea- 
tures of the ancestral organism than do extant 
vertebrates. It is well known that the enrichment of the 
CpG dinucleotides in CpG island promoters is maxi- 
mum in TSSs [12,13], so TSSs constitute candidate 
regions in which CpG island promoters or CpG island- 



like sequences might occur in the invertebrate genome. 
Incidentally, this approach that targets TSSs also cir- 
cumvents the confusion arising from CpG-rich 
sequences that are indifferent to transcription initiation. 
In the computational study mentioned above, promoter 
regions were defined using the RefSeq database, which 
is a curated collection of publicly available nucleotide 
sequences [16]. It is likely that many of the cDNA 
entries are truncated or incomplete at the 5' end which 
makes the definition of their promoter regions unreli- 
able. More importantly, the TSSs of approximately half 
of all ascidian genes can hardly be determined because 
of mRNA 5'-leader trans-splicing [22-24]. The 5' ends of 
those primary transcripts, termed the outron, are dis- 
carded via the trans-splicing reaction. This fact is easily 
exemplified by downstream operonic genes, which are 
resolved from their primary transcripts by trans-splicing 
[25]. Although it is almost impossible to know TSSs of 
them, it is essential to be distinguished from non- trans- 
spliced genes and to know the most 5' end position of 
the processed transcripts. Analyzing these data, we 
determined the structural features of the ascidian pro- 
moters and compared them with human promoters to 
identify and characterize their similarities and differ- 
ences. To extend our understanding of gene regulation 
in higher eukaryotes, we undertook to clarify the origin 
of CpG islands and the two classes of vertebrate 
promoters. 

Results 

In this study, we chose C. intestinalis embryos at the 
mid-tailbud stage (Additional file 1: Figure SI) for the 
genome-wide identification of TSSs. Since whole 
embryos still retaining the notochord contain a wide 
range of cell types, we may cover a large part of ascidian 
promoters. Total RNA was extracted from embryos and 
was subjected to oligo capping in which the 5' cap of 
the mRNA was replaced with a synthetic RNA oligonu- 
cleotide (see Methods). After cDNA synthesis and sub- 
sequent PCR, we undertook massively parallel 
sequencing using the Illumina Genome Analyzer. We 
obtained two data sets containing fragments of different 
lengths 36 nt or 48 nt. Because we read the sequences 
from the 3' end of the RNA oligonucleotide, all the 
sequences obtained should start with GG at their 5' 
ends (see Methods). We recovered only the reads that 
started with GG, but then trimmed the GG from those. 
Although the genie sequences were trimmed by two 
nucleotides, this protocol eliminated dubious sequences 
that do not start with the dinucleotide. We also elimi- 
nated sequences containing undetermined nucleotides 
other than T, C, A, and G, yielding 4,247,902 reads of 
34 nt and 4,770,608 reads of 46 nt. To detect the spliced 
leader (SL) of C. intestinalis, we considered, in addition 
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to the canonical 16-nt sequence, all similar sequences, 
allowing a 1-nt mismatch or indel and some previously 
reported variants [24]. The 34-nt data set consisted of 
1,849,849 non-trans-spliced and 2,398,053 trans-spliced 
reads and the 46-nt data set consisted of 2,052,230 non- 
trans-spliced and 2,718,378 trans -spliced reads. Even if 
some SL-related 5' mRNA sequences escaped from 
being detected by this process, it is unlikely such reads 
would map to the genome in the following step. Map- 
ping or alignment to the KH assembly [25] was per- 
formed as described in the Methods. Sequences that 
mapped to more than one locus (multiple hits) were not 
considered further. The numbers of mapped 34-nt and 
46-nt reads were 1,017,283 {non-trans-spliced), 
1,932,570 {trans-spliced), 939,092 (non-£rans-spliced), 
and 1,237,720 {trans-spliced), respectively. Because the 
original 5'-segment of a pre-mRNA is discarded during 
the trans-splicing reaction, mature trans-spliced mRNAs 
do not contain the initial segment of the primary tran- 
script and therefore lack the information required to 
precisely identify TSS [22]. Therefore, we decided to 
mainly examine non-£ra#s-spliced reads to provide valid 
data for the promoter analyses presented here. The 
genomic positions to which the 5' ends of the reads 
were aligned were defined as TSSs. The read counts 
were converted to values in parts per million (ppm) for 
transcript abundance estimation and normalization, and 
both of the short and long data sets were merged. The 
TSSs, which are generally scattered around a promoter 
region [26], were then clustered into 100-bp bins to 
define each promoter. In other words, two reads located 
more than 100 bp apart without any other reads 
between them were considered to be regulated by two 
separate promoters [26]. In this clustering process, TSSs 
represented by reads occurring at less than 0.5 ppm 
were not considered. However, once promoters were 
defined, all the TSSs in the bins were counted to esti- 
mate the abundance of transcripts from each cluster. 
Because we can assume that every cell contains approxi- 
mately one million mRNA molecules, we can consider 
the values in ppm as copy numbers of the transcripts in 
a cell [27]. We set a threshold of 1.0 ppm to exclude 
transcriptional noise. As a result, we obtained 6312 and 
8753 promoters for non-trans-spliced and trans-spliced 
genes, respectively, that could be considered active in 
the tailbud embryos. The most frequent TSS in each 
promoter (and if there were several, the most upstream 
one) was selected as its representative TSS. If the corre- 
sponding genes were found in the KH gene model [25], 
the gene names were also tabulated (Additional file 2: 
Tables SI and S2). Note that one gene can have several 
alternative promoters. 

The initiator (Inr) motif, which spans the TSS, is the 
most commonly occurring sequence motif observed in 



metazoans [28]. Its consensus sequence between mam- 
mals and fruit fly is pyrimidine-purine (YR), where R 
corresponds to the exact TSS [29]. By aligning core pro- 
moter sequences of all the 6312 non-trans-spliced tran- 
scripts with consideration of their orientation, we 
confirmed that the ascidian promoters also follow the 
YR consensus, suggesting that the sequence processes 
described above are plausible (Figure 1). In this figure, 
all the representative TSSs are aligned at position 0. The 
next positions upstream and downstream are designated 
-1 and +1, respectively. This notation is used in the rest 
of the present paper. Another alignment of all the 8753 
trans-spliced transcripts is also shown. In this case, 
however, the position 0 means the most 5' end of the 
transcripts after removing SLs. 

We then examined the genome-wide distributions of 
the CpG scores in both the whole genome and the pro- 
moters of non-trans-spliced transcripts, using a sliding 
window of 1 kb. To compare them with the correspond- 
ing vertebrate distributions, we performed the same ana- 
lysis using the human genome (Figure 2). We defined a 
sequence fragment from -499 to +500 as a promoter. A 
similar analysis of the CpG-score distributions has 
already been reported [16]. Although the definitions of 
the promoter sequences differ in these studies, we 
obtained fundamentally identical results. The human 
genome is globally methylated and CpG dinucleotides 
occur in bulk at only one-fifth of the expected frequency 
[17]. In contrast, the ascidian genome contains approxi- 
mately equal amounts of methylated and unmethylated 
regions, which may have resulted in CpG-poor and 
CpG-rich sequences, respectively [7,14]. Intriguingly, the 
ascidian and human promoters show unimodal and 
bimodal distributions, respectively. The latter distribu- 
tion indicates that the human has two classes of promo- 
ters, CpG-poor and CpG-rich. The CpG-rich promoters 
can be considered to contain a CpG island. In contrast, 
the ascidian promoters generally tend to have high CpG 
scores and exhibit a unimodal distribution. This obser- 
vation led to the hypothesis that human CpG-poor pro- 
moters emerged with the deamination of methylated 
CpG dinucleotides in CpG island promoters [16]. Using 
our experimental data, we intended to substantiate this 
idea and define the CpG islands in the invertebrate 
genome. 

We excised 4-kb promoter sequences (2 kb upstream 
and 2 kb downstream from each representative TSS) of 
the ascidian non-trans-spliced, and human CpG-poor, 
and CpG-rich promoters, and aligned them with consid- 
eration of the transcriptional orientation to determine 
the overall changes in the CpG scores and G+C contents 
in the vicinity of the TSSs (Figure 3). We used Database 
of Transcription Start Sites (DBTSS) to select the human 
promoter sequences [27]. The methodological details 



Okamura et al. BMC Genomics 201 1, 12(Suppl 3):S7 
http://www.biomedcentral.eom/1 471 -21 64/1 2/S3/S7 



Page 4 of 1 1 



such as grouping human CpG-poor and CpG-rich pro- 
moters are described in the Methods. Our results con- 
firmed that the ascidian promoters tended to have high 
CpG score and G + C contents around TSS, as was 
observed in the human promoters. However, judging 
from the heights and extents (widths) of the peaks 
around the TSSs, the ascidian promoters seem more 
similar to the human CpG-poor promoters than to the 
human CpG island promoters (Figure 3B). Although the 
ascidian TSSs exhibited quite high CpG score, this fact 
does not necessarily mean that they have high frequency 
of the CpG dinucleotide (Figure 3A). The low content of 
G + C underestimated the expected number of CpG, 
which in result increased the ratio of the observed over 
expected numbers of the dinucleotide, i.e. CpG score. 
Hence, we defined "CpG content" to show its plain den- 
sity (see Methods) and drew the changes (Figure 3C). 
The heights and extents were comparable between the 
ascidian and CpG-poor promoters and their contents 
were regularly lower than the expected content for any 
dinucleotide, 0.0625 or 1/16. In addition to CpG, we also 
analysed the changes in all the other dinucleotide scores 
in the vicinity of the TSSs (Additional file 3: Figure S2). 
Distinct features were also observed at the TSSs for all 
these dinucleotide scores. This information may possibly 



be used to predict the locations of promoters and their 
corresponding genes. 

Among the dinucleotides, the local frequencies of TpG 
and CpA can be used as indicators of DNA methylation 
levels [4]. We calculated the TpG and CpA scores for 1- 
kb promoter sequences and charted their distributions 
for the three classes of promoters (Figure 4). All the six 
histograms showed a unimodal bell-shaped distribution, 
e.g. p < 10~ 15 by Kolmogorov-Smirnov test for Figure 4A, 
indicating that they were formed by promoters having 
homogeneous characters in terms of the dinucleotide 
scores. Whereas the distributions of the human CpG 
island promoters are centered at the value of 1.0, the dis- 
tributions of the ascidian and CpG-poor promoters are 
shifted to higher-score regions, where observed numbers 
of the deaminated dinucleotides are larger than their 
expected numbers. It is more likely that deamination of 
CpG sites are common. The high frequency of deamina- 
tion in the ascidian and CpG-poor promoters suggests 
that these regions are relatively methylated unlike CpG 
islands. Because mutations in somatic cells have not been 
transmitted evolutionarily, what we observed here is the 
result occurred in germ line. The DNA methylation 
could be tissue-, stage-, or cell-type-specific and play a 
role in spatiotemporal gene regulation. 




B 



Relative position from TSS 




Relative position from 5' end 

Figure 1 Sequence logo around the ascidian transcription start sites. (A) All the 6312 promoter sequences were aligned around the 
representative TSSs with consideration of their transcriptional orientation. The pyrimidine-purine (YR) consensus was also observed in the 
ascidian genome. The TSS is located at position 0. (B) A similar alignment of the 5' ends of the first exons of all the 8753 trans-spliced transcripts 
is also shown. In this case, the position 0 means the most 5' end of the exons. Splice acceptor sequences, which are replaced with SLs in the 
trans-splicing reaction, can be observed. The whole replaced sequences are also known as outrons. 
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Figure 2 Distributions of the CpG score frequencies in 1-kb genomic fragments and promoters. The histograms show the distributions in 
the ascidian genome (A), the human genome (B), ascidian promoters (C), and the promoters of human protein-coding genes (D). Vertical dotted 
lines indicate a CpG score of 0.721, around which human CpG island promoters are dispersed [5]. The distributions of the overall genome scores 
differ between human and ascidian (unimodal vs. bimodal, respectively), but the distributions of the promoter scores of the human and ascidian 
sequences differ obversely (bimodal vs. unimodal, respectively). 

V J 



Lastly, we examined the usage of the four YR dinu- 
cleotides (CpA, CpG, TpA, and TpG) at the YR-consen- 
sus sites (positions -1 and 0). This analysis was 
performed using representative TSSs, which have a one- 
to-one correspondence with promoters. As noted above, 
these dinucleotides are preferentially used as TSSs in a 
wide range of animals [29]. However, the frequencies of 
the dinucleotides are not equivalent (Figure 5). CpA is 
the most commonly observed as the representative TSS 
in both ascidian and human genomes. The second pre- 
ference is for CpG in human CpG island promoters. 



The usages of CpG are 4.2%, 3.5%, and 18.1% in the 
ascidian, human CpG-poor, and CpG island promoters, 
respectively. Although the ascidian promoters tended to 
exhibit high CpG scores (Figure 2), CpG seems to be 
used rarely as the transcription initiation point. 

Discussion 

The CpG island promoters seen in vertebrates are 
believed to have emerged from the deamination of other 
regions [17]. Therefore, it is plausible that the appear- 
ance of the two classes of vertebrate promoters is also a 
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consequence of deamination, following the global DNA 
methylation that occurred early in vertebrate evolution 
[16,30]. Specific sequence motifs that function as tran- 
scription factor binding sites might have retained some 
CpG -rich sequences from the methylation and mutation 
to form CpG island promoters [31-33]. To confirm this 
hypothesis, we used a large-scale experimental approach 
to identify the TSSs of C. intestinalis. On the basis of 
our TSS information, we then examined the ascidian 
promoter sequences. The fact that the CpG scores, i.e. 
the ratios of the observed CpG number to the expected 
CpG number, tended to be quite high in the vicinity of 
the ascidian TSSs led us speculate CpG island promo- 
ters [16]. However, it had to be noted that the G+C and 
CpG contents are low. When we applied the most con- 
ventional and conservative CpG island definition [8] to 
the promoters, only 3.5% (223 out of the 6313 promo- 
ters) meet the criteria. This is attributable to the fact 
that the ascidian G+C content, approximately 0.36, is 
much lower than the G+C criterion of 0.5 (Figure 3B). 
Even at TSSs, the average ascidian G+C content is 
approximately 0.4 at the most. Besides, the ascidian 
CpG score is much higher than the criterion of 0.6 (Fig- 
ure 3A). If we try to define new criteria for the ascidian 
genome, the difference in the values for the TSSs and 
background sequences is much smaller than that 
observed for the human genome. The unique feature of 
the non-vertebrate deuterostome genomes, Le. the pre- 
sence of comparable amounts of CpG-poor and CpG- 
rich domains [7], also hinders us in defining CpG 
islands in these animals. 

Contrary to our initial expectation, we failed to iden- 
tify CpG island-like promoters in the invertebrate gen- 
ome. Instead, we found that the general features of 
ascidian promoters are similar to those of CpG-poor 
vertebrate promoters rather than to CpG island promo- 
ters. It is reasonable to consider CpG-poor promoters 
more ancient because they are found in a wide variety 
of eukaryotes [29]. Conversely, CpG island promoters 
must have appeared in an early stage of vertebrate evo- 
lution, derived by some mechanism, and have been 
adopted as important cis regulatory elements in descen- 
dant species. Because the CpG score is just the ratio of 
the observed to the expected numbers of dinucleotides, 
a high score does not necessarily mean a high frequency. 
We defined and used "CpG content", which showed a 
substantially different feature from CpG score in the 
ascidian genome (Figure 3C). Note that the CpG score 
and CpG content profiles are dissimilar and similar in 
the ascidian and human genomes, respectively. The 
CpG content will also be important to scrutinize gen- 
omes especially of various animals other than mammals. 
It is unlikely that the conventional CpG island defini- 
tions using only CpG score, G+C content, and length 
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Figure 3 Changes in the CpG scores (A), G+C contents (B), and 
CpG contents around the TSS. The local CpG score, G+C content, 
and CpG content at each position within a 4-kb region, with a moving 
window size of 100 bp, were averaged for the ascidian, human CpG- 
poor, and CpG island promoters. The horizontal axes indicate the 
position relative to the representative TSS. Horizontal dotted lines 
indicate the conventional criteria for vertebrate CpG islands (A, B) and 
the expected content of each dinucleotide, 1/16 (C). 
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function in invertebrate genomes. Because the deamina- 
tion of methylated CpG sites cannot explain the sub- 
stantial increase in the CpG and G+C contents in the 
vicinity of vertebrate TSSs, we must search for and 
examine active mechanisms that may have given rise to 
CpG islands. The biased gene conversion [18,34], the 
condensation of CpG-rich protein-coding sequences by 
retrotransposition [35], and the expansion of elements 
containing the CpG dinucleotide [36] are potential 
molecular mechanisms. The fact that CpG islands are 
not conserved satisfactory among species [8] may indi- 
cate that CpG island loss and gain are active phenom- 
ena, occurring up to the present time, even in extant 
vertebrates. 

The number of C. intestinalis genes is reported to be 
15,254 in the KH gene model [25]. Whereas series of 
operonic genes have single promoters, alternative pro- 
moters have been reported for a large number of genes. 
The number of all RNA polymerase II promoters, 
including those of non-coding transcripts, may exceed 
20,000. This study targeted the promoters that are active 
in the embryos. Although we believe that the 6312 pro- 
moters analysed here may well represent most of them, 
we eagerly await techniques with which to identify the 
TSSs of £rans-spliced genes. Utilizing our data, the TSS 
of the Tnl gene was recently identified as the first case 
for Ciona £rans-spliced genes [36]. CpG island promo- 
ters cannot be seen at least for this gene. 

Conclusions 

We have experimentally identified and characterized 
ascidian promoter sequences as the primordial type of 
vertebrate promoter. As far as we know, this is the first 
case for non-vertebrate deuterostomes. The sequences 
near TSSs tend to exhibit high CpG score and high G 
+ C content, but their level and extent are actually 
restricted. Furthermore, the promoter sequences seem 
to be at least partially methylated. It is unlikely that they 
were the original type of vertebrate CpG island promo- 
ters. Rather than global methylation and subsequent 
deamination, some active mechanisms and maintaining 
mechanisms have presumably been required to form 
such a long and CpG-condensed region in vertebrate 
animals. 

The genomes of more than 50 vertebrate species have 
been sequenced and even more genomes will be 
sequenced in the future [38]. Now that an ascidian gen- 
ome has been shown to lack CpG islands that function 
in promoter sequences, our curiosity is directed to pri- 
mitive vertebrates, such as agnathans. It could be super- 
ficial to make a strong conclusion at this point. The 
searching for primitive organisms with CpG island pro- 
moters in order to determine the origin of CpG islands 
will certainly extend our understanding of the 



sophisticated roles of DNA methylation in higher eukar- 
yotes [39-41]. 

Methods 

RNA extraction, oligo capping, and RNA-seq with the 
lllumina Genome Analyzer 

More than 200 [ig of total RNA was isolated from whole 
mid-tailbud-stage ascidian embryos (12-hour-old 
embryos), using ISOGEN (Nippon Gene) according to 
the manufacturer's protocol. The RNA was subjected to 
oligo-capping method [19]. In short, after successive 
treatments with bacterial alkaline phosphatase (TaKaRa) 
and tobacco acid pyrophosphatase (Ambion), the treated 
RNA was ligated to an RNA oligonucleotide with the 
sequence 5'- AAU GAU ACG GCG ACC ACC GAG 
AUC UAC ACU CUU UCC CUA CAC GAC GCU 
CUU CCG AUC UGG -3' using T4 RNA ligase 
(TaKaRa). After treatment with DNase I, the poly(A) + 
RNA was selected and used as the template for the first- 
strand cDNA synthesis with the primer 5'- CAA GCA 
GAA GAC GGC ATA CGA NNN NNN C -3\ The 
cDNA was then used as the template for PCR with the 
primers 5'- AAT GAT ACG GCG ACC ACC GAG -3' 
and 5'- CAA GCA GAA GAC GGC ATA CGA -3\ The 
products were size fractionated by polyacrylamide gel 
electrophoresis. Approximately 1 ng of the 150-200-bp 
fraction was used for the sequencing reactions on the 
lllumina Genome Analyzer (Solexa). Both 36-cycle and 
48-cycle sequencing reactions were performed on the 
same samples. The DNA sequences have been deposited 
in [DDBJ Sequence Read Archive: DRA000156]. 

Sequence data analysis 

lllumina Pipeline (GAPipeline 1.0) was used to extract 
the sequenced reads from the image data. The spliced 
leaders (SLs) in the trans -spliced sequences were 
replaced with splice acceptor sequence "ag" for the sub- 
sequent mapping. The sequences were aligned to the 
KH assembly [25] using SeqMap [42] for the 36-cycle 
reads, or to BLAT [43] for the 48-cycle reads, because 
of the high rate of ds-splicing. Because of the highly 
polymorphic genie features of this organism [44], we 
used a 90% match criterion, including insertions and 
deletions. If the 5' end of a read was not aligned to the 
genome, the read was eliminated from the analysis. Mul- 
tiple hits were removed, and only single best hits were 
considered for the subsequent analyses. Sequence logos 
were drawn with WebLogo 2.8.2 (http://weblogo.berke- 
ley.edu/). The CpG score was defined as CpG * N I CI 
G with C, G, CpG, and N observed numbers of C, G, 
and CpG and the fixed window size, respectively. The 
CpG content defined in the present study was CpG I (N 
- 1). The assembly used for the human genome was 
UCSC hgl8. To select human CpG-poor (CpG score < 
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TpG score CpA score 



Figure 4 Distributions of TpG and CpA scores in 1-kb promoters. The left histograms show the distributions of TpG scores in the ascidian 
(A), human CpG-poor (B), and CpG island promoters (C). The right histograms show the distributions of CpA scores in the ascidian (D), human 
CpG-poor (E), and CpG island promoters (F). Vertical dotted lines indicate the positions of mean, i.e. 1.14, 1.24, 1.00, 1.14, 1.24, and 0.99 for (A)-(F), 
respectively The score 1.0 means that the observed and expected numbers of the dinucleotide are equal, suggesting no methylation effect in 
this case. 

V J 
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Figure 5 Usage of the four YR dinucleotides at the TSS. The proportional use of each YR dinucleotide at the representative TSSs was 
calculated for the three promoter groups: ascidian (A), human CpG-poor (B), and human CpG island promoters (C). Dinucleotides other than 
CpA, CpG, TpA, and TpG were ignored in this analysis. Whereas CpG is the second most often used dinucleotide in human CpG-rich promoters, 
the dinucleotide is used least in ascidian and human CpG-poor promoters. 
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0.5) and CpG-rich promoters (CpG score > 0.6), we 
used DBTSS 6.0 (http://dbtss.hgc.jp/) and calculated 
CpG scores in 200-bp regions around representative 
TSSs [45]. The analysis was limited to protein-coding 
genes, but all the alternative promoters deposited in the 
database were included (out of all 101,436 promoters, 
32,122 were for protein-coding genes). The numbers of 
CpG-poor and CpG-rich promoters were 18,034 and 
12,493, respectively. Dinucleotides other than pyrimi- 
dine-purine (YR) were not considered in the analysis of 
the usage of the YR motif. The total numbers of YR 
motifs at TSSs were 3,610, 8,162, and 8,610 for ascidian, 
human CpG-poor, and human CpG island promoters, 
respectively. All the sequence analyses were performed 
with Perl scripts, which are available upon request. 

Additional material 
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